1.1. Background
This report examines the impact of AI-assisted coding tools on coding discussion forums. Large language models (LLM) have rapidly gained popularity and become integrated into daily life. In particular, programmers now frequently use AI-assisted coding tools such as ChatGPT, Gemini Code Assist, and GitHub Copilot in their workflows. The capabilities of these AI tools drew widespread attention when ChatGPT launched in November 2022, reaching over 100 million users within two months (Wikipedia contributors, 2024). To capture the short-term trend of this development, we restricted the analysis timeframe to 2021–2025, allowing for a comparison before and after ChatGPT’s release.
1.2 Dataset
The analysis focuses on Stack Overflow’s public dataset, available through the Stack Exchange API. Stack Overflow is one of the largest question-and-answer platforms for programmers, with over 23 million users and nearly 60 million posts as of March 2024 (Wikipedia contributors, 2024). The API provides access to contents and interactions on Stack Overflow through entities such as Answers, Badges, and Comments (Stack Exchange, n.d.). For this study, we specifically examine the Posts, Tags, and Users data. Posts include both questions and answers, while Tags are the topics to categorize and organize content (Stack Overflow Teams, 2024). These data offer valuable insights into user engagement, content quality, and the evolving landscape of Stack Overflow over time.
1.3 Objective
To investigate the impact of AI-assisted coding tools on coding discussion forums, we focus on two key questions:
By analyzing these aspects, we aim to infer the effects of ChatGPT’s launch on Stack Overflow’s users and content, providing a preliminary answer to the research question. (See project repository for data, code, and additional materials.)
2.1 Acquire Data
The Stack Overflow’s public data was retrieved using the Stack
Exchange API v2.3. Since anonymous API users have a daily quota limit,
we registered for an API key (StackApps, n.d.). As instructed in the
documentation, custom filters were created through the API to extract
datasets with specific attributes. Once the desired filter was generated
as an encoded string, we pass it along with the other parameters,
order, sort, site,
fromdate, todate, page, and
pagesize to the API request. To ensure an even distribution
of posts from 2021 to 2025, we generated a sequence of start and end
dates for each month between 2021-01-01 and 2025-01-01, making separate
requests for each month. We fixed pagesize at 100 and merged the
retrieved monthly data to create a balanced dataset over the timeframe.
For user data, we used a timeframe from 2021-01-01 to 2024-04-01,
constrained by API quota limits. For tags, since the Stack Exchange
database contains approximately 65,000 tags, we retrieved the entire
dataset without filtering. In total, we collected three datasets—posts,
users, and tags—amounting to approximately 97 MB. The data was stored as
CSV files for easy reuse. The retrieval process was implemented using
the httr package in R.
2.2 Data Cleaning and Wrangling
To ensure the dataset was in a usable format, we first inspected and
summarized each dataset. We found that 93.77% of location
data in the users dataset and 0.94% of owner_id data in the
posts dataset were missing, while all other variables had complete data.
Given the focus of this analysis, we omitted the location
variable but retained owner_id. Next, we reviewed all
variables, removing irrelevant ones and renaming the rest for clarity.
Exploratory visualizations revealed highly correlated and duplicate
variables, which we also removed. Additionally, we identified two
article entries in the posts dataset that were neither questions nor
answers and excluded them. Based on the distribution and usage of tags,
we decided to retain only the 100 most popular tags for analysis.
We then consolidated variables into new variables tailored to our
question. We introduced lifespan, calculated as the time
gap between creation and last access/modification dates in both the
users and posts datasets. Based on the distribution of lifespan,
reputation, and total badge counts, we categorized users into three
engagement levels: “Expert”, “Experienced”, and “Normal”. Similarly, we
classified posts into three engagement levels based on lifespan, vote
counts, and score: “High”, “Moderate”, and “Low”. We also introduced a
quality variable for posts, categorizing them as: “Good”,
“Normal”, “Bad”, and “Controversial” based on the total score and the
ratio between upvotes and downvotes. Lastly, we joined the users and
posts datasets by matching the post owner’s user ID, creating a fourth
dataset that facilitates the analysis of relationships between users and
their posts. The cleaning and wrangling procedures were carried out
using R packages including tidyverse, dplyr,
and data.table.
2.3 Natural Language Processing (NLP)
The text content of posts was also classified into new categorical
variables. We identified error-related keywords (e.g., segmentation
fault), actionable language (e.g., fix), and contextual phrases (e.g.,
ASAP), combining these into a score to categorize a post’s
intention as either debugging or discussion. Similarly, we
developed a complexity score based on posts’ lengths,
markdown structure, number of technical terms, and vocabulary diversity,
classifying posts as either basic or complex. In addition, we tokenized
the post body and assigned one of the 100 most popular tags by matching
tokens to tags and selecting the highest-frequency match. To identify
whether a post was related to AI, we also applied Latent Dirichlet
Allocation (LDA) topic modeling, with methodological details provided in
section 2.5.1.
Given the limitations of manual classification, we further utilized
several machine learning models—Decision Tree, Random Forest, Bagging,
Boosting, and XGBoost—to enhance the classification process. To prepare
the post content for these models, we generated word embeddings by first
constructing a term co-occurrence matrix, which captured the contextual
proximity of words within a fixed window. This matrix served as input
for the Global Vectors for Word Representation (GloVe) algorithm, which
produced 50-dimensional semantic vectors for each word. Post-level
embeddings were then computed by averaging the vectors, creating numeric
feature vectors suitable for classification. To accelerate this
computationally intensive task, we implemented parallel processing using
high-performance computing (HPC) packages. The processed datasets were
saved as CSV and RDS files for efficient reuse. This NLP pipeline relied
on stringr, topicmodels,
tokenizers, and text2vec, along with
furrr, future, and parallel for
distributed processing.
2.4 Data Visualization
For exploratory data analysis and visually evident statistical
modeling, we employed a range of visualizations: violin plots, box
plots, and barcode plots to depict distributions; scatter plots and
correlation heatmaps to demonstrate correlation; bar plots to present
rankings; and circle timeline plots, area charts, and calendar heatmaps
to visualize change over time. Several of these graphics were made
interactive, allowing filtering, selecting, and animating data across
time, enhancing interpretability and engagement. The visualizations were
developed using R packages including ggplot2,
ggcorrplot, plotly, and
gridExtra. The choice and design of visuals were inspired
by Financial Time’s Visual Vocabulary (Financial Times Visual
Vocabulary, n.d.) and Andy Kriebel’s Tableau workbook (Kriebel,
n.d.).
2.5 Statistical Modeling
2.5.1 Classification For NLP
As outlined in section 2.3, a range of classification models was used to analyze and categorize post content. A brief overview of each method is provided below:
LDA was applied to extract latent topics from post bodies to
determine whether a post was AI-related. For classifying post
characteristics such as engagement, quality, complexity, and intention,
we evaluated the performance of Decision Tree, Random Forest, Bagging,
Boosting, and XGBoost models. Among these, the Boosting model achieved
the best performance, with the lowest Root Mean Squared Error (0.3603)
and the highest R-squared value (0.3000). This final model was
configured with a Gaussian distribution, 1,000 boosting rounds, a
learning rate of 0.5, and five-fold cross-validation. Implementation of
these models was supported by the rpart,
randomForest, gbm, xgboost, and
caret packages in R.
2.5.2 Model Selection
To address the report’s core research questions, we transformed the
creation timestamps of posts and users into daily intervals to
facilitate time series analysis of posting and user activity trends.
Alongside this continuous time predictor, we created a binary
categorical variable, turnpoint, to indicate whether each
observation occurred before or after the launch of ChatGPT (2022-11-01),
enabling inferential comparisons. To model changes in post counts over
time, we used models based on Negative Binomial distribution with a log
link function to account for potential over-dispersion in count data.
The following statistical models were considered:
To analyze the distribution of users across engagement levels, we
treated user proportions as a compositional response and applied
Dirichlet Regression. This model accommodates
multivariate proportion data that sum to one and captures dependencies
among components while modeling their relationships with covariates. For
modeling the ratio of complex to basic posts, we considered both a
Binomial GAM with a log-odds link function and a Beta
Regression model. The latter offers flexibility in modeling
proportional data constrained between 0 and 1. Across all GAM models, we
used 30 basis functions (k = 30) to balance smoothness and model
flexibility. Model fitting was performed using a suite of R packages,
including mgcv, glmmTMB,
DirichletReg, and betareg.
2.5.3 Test Statistics
Model evaluation was conducted using a range of statistical tests and performance metrics:
While most statistics are provided automatically via R’s
base::summary() function, the pseudo R-squared was obtained
using the pscl package.
3.1 Exploratory Data Analysis
3.1.1 Post
Most of the variables measuring engagement and quality in the users and posts dataset exhibit strong skewness. Notably, 45.16% of posts are active for less than 24 hours, while 9.07% are active for more than a week. As illustrated in Figure 1, both distributions have a light, long right tail, emphasizing this skewness.
Only 5.65% of posts are controversial, meaning they receive a
significant number of both upvotes and downvotes. While one might expect
a negative correlation between upvotes and downvotes, Figure 2 suggests
otherwise. The line of best fit indicates that posts with more upvotes
tend to also receive more downvotes, possibly due to the polarizing
nature of controversial posts.
The skewness could be explained by the possibility that a small
proportion of basic, broadly useful questions generate high engagement,
while the majority of questions are niche-specific and engage only a
small subset of users. Additionally, controversial posts may attract
high engagement due to their inherently divisive nature.
3.1.2 User
A similar skewed distribution is observed in the users dataset. Users are far more likely to upvote posts as a form of agreement or encouragement, while only 0.04% have ever downvoted a post. Badge counts also reflect this imbalance, up to the third quantile, users hold zero silver or gold badges, as well as zero upvotes or downvotes. User reputation follows an extreme skew as shown in Figure 3, with the minimum, median, and third quantile all valued at 1, while the maximum reaches 4,341.
Figure 3 shows that the density distribution of change in
reputation over months nearly coincides with weekly changes, and
similarly, quarterly changes align with yearly trends. This is supported
by Figure 4, where the correlation matrix shows that reputation changes
over different time intervals are highly correlated (>0.95).
Additionally, reputation is strongly correlated with answer count (0.85)
and silver badge count (0.68).
These findings suggest that a small percentage of users actively
engage in discussions, while the majority passively browse and search
for information.
3.1.3 Tag
The tags dataset is straightforward—the top 20 tags correspond to widely used programming languages and concepts. JavaScript ranks first, followed by Python and Java.
3.1.4 Classification
As outlined in the method section, relying solely on manually defined
metrics to classify post content can be limiting. We ran classification
with the best-performing boosting model to gain deeper insights. Figure
6 presents the top five most influential features for each
classification. Notably, the total number of votes and the active
lifespan of a post emerged as strong indicators for predicting
engagement level, while the number of downvotes played a significant
role in determining content quality. Additionally, word embeddings
(dim_#) helped distinguish between debugging and discussion
posts, and the inferred intention also contributed to identifying its
complexity.
3.2 Statistical Modeling
3.2.1 Impact on Reliance
Recall we are interested in whether ChatGPT’s launch reduced
programmers’ reliance on Stack Overflow. Figure 7 shows a clear decline
in basic question posts since 2022. The GAM suggests that time and
complexity significantly impacted post counts (\(p < 2e-16\)), explaining 91.4% of the
deviance. Relying on the interaction between turnpoint and
complexity improved the GLM, achieving a pseudo R-squared of 0.9611
(\(p < 2e-16\)). These results
suggest a decrease in basic coding questions, possibly due to the use of
AI tools instead of forums for basic coding help.
A clear trend emerges when examining the complexity of posts.
Figure 8 shows a consistent rise in the ratio of complex to basic posts.
The Beta regression model indicates that both the time variable (\(p \approx 0.0005\)) and the
turnpoint (\(p <
2e-16\)) significantly influenced this ratio, with the model
explaining 71.67% of the deviance. These findings suggest that users are
increasingly turning to forums for more complex coding inquiries and
responses, possibly leaving simpler tasks to LLM coding assistance.
We also hypothesized that AI tools might reduce contributions
from normal users. As shown in Figure 9, post activity among normal and
experienced users rose from 2022 through mid-2023, followed by a
noticeable drop, while expert users exhibited a steady decline in
posting since 2021. This is contrary to our hypothesis. The GLM found
that after ChatGPT’s launch (\(p <
2e-16\)), normal user status (\(p
\approx 0.018\)), and their interaction term (\(p \approx 0.0407\)) were all significant
predictors of post counts. The GAM model explains 69.8% of the deviance,
outperforming the GLM (pseudo R^2 ). Note these results are based on the
merged dataset instead of full datasets, which may introduce sample bias
due to its smaller size.
Additionally, we explored whether the composition of user types
has shifted. Figure 10 shows the share of experienced and normal users
has steadily increased, while the share of expert users has declined
drastically. The Dirichlet regression model confirms the significance of
turnpoint (\(p \approx 5.85e-7; p
\approx 5.8e-13\)) and the time variable (\(p \approx 0.0080; p \approx 0.0113\)) in
impacting experienced and normal user shares, respectively. Before the
turnpoint also significantly impacted the share of expert
users (\(p \approx 0.0005\)),
underscoring the influence of AI tools in reshaping platform
participation. The model achieved a strong McFadden’s pseudo R-squared
of 0.3675.
In summary, AI-assisted coding tools appear to have reduced
reliance on coding forums among beginner programmers asking basic
questions. However, the engagement level of advanced users has
declined.
3.2.2 Impact on Content Structure
Beyond engagement and reliance, we are also interested in examining whether LLM coding assistance have influenced the nature of content posted on coding forums. Figure 11 compares debugging and discussion-oriented posts, highlighting that the number of normal-quality debugging posts has slightly increased since 2023 (\(p < 2e-16\)), while good-quality discussion posts have declined over the same period (\(p < 1.84e-6\)). Other content categories remained relatively unchanged. The GAM including the interaction between post quality and intention explained 97.6% of the deviance in post volume.
To further assess how post wording may have evolved, we
projected word embeddings of post bodies into two dimensions. While this
visualization does not provide direct semantic interpretation, it
reveals visible shifts in clustering patterns over time, suggesting
structural changes in how users compose their posts.
Figure 13 demonstrates the examination of the impact of AI tools
on posts in the top 10% of popular tags. It exhibits a significant
decline in posts related to basic concepts such as lists and functions,
while posts tagged with more complex topics like APIs have surged. The
GLMM revealed that post timing relative to ChatGPT’s launch
significantly contributed to the difference in post counts (\(p < 2e-16\)), while the GAMM emphasized
the importance of the random effects of tags (\(p < 2e-16\)), explaining 86.8% of the
deviance. These results suggest that while AI tools are increasingly
used for routine coding issues, users continue to seek forum support for
more advanced or nuanced programming topics.
In conclusion, the emergence of AI-assisted coding tools appears
to encourage programmers to post more complex questions and discussions,
rather than basic inquiries.
4.1 Limitations
While this study provides meaningful insights into the evolving landscape of Stack Overflow usage, several limitations should be considered. We also propose potential directions for improvement:
Manual Scoring for Classification
The classification framework relied on manually defined scoring rules to assess post complexity and intention. Although machine learning models were introduced to mitigate bias, more advanced NLP methods such as zero-shot learning using transformer-based models or label-embedding approaches like Lbl2Vec could yield more accurate and less bias results.
Imbalance in AI-Relevant Content Detection
Topic modeling via LDA was used to identify AI-relevant posts, but only two topics were categorized as such, resulting in a highly imbalanced dataset. Future work should consider collecting a larger volume of posts via the Stack Exchange API and applying more robust classification methods, allowing us to explore AI’s influence on content creation as well.
Lack of Direct AI Tool Usage Data
This study inferred the impact of AI tools from trends in posting behavior, but it did not directly measure AI tool usage. To conclude a causal relationship, future research should incorporate data on individual users’ interactions with AI tools. Causal inference methods such as randomized controlled trials (RCTs) or quasi-experimental designs could then be employed for more rigorous evaluation.
By addressing these limitations, future studies can strengthen the validity and generalizability of findings, ultimately offering a clearer picture of how AI-assisted tools are transforming engagement and content on coding forums.
4.2 Discussion
Despite its limitations, this analysis offers valuable insights into how AI-assisted coding tools may be influencing engagement on platforms like Stack Overflow. The findings reveal a strong correlation between posting behavior and time, particularly around the release of ChatGPT in November 2022. Key shifts include a noticeable decline in basic question posts, a rising ratio of complex to basic posts, and evolving trends in tag-specific post counts. Interestingly, the decline in user proportion is most pronounced among expert users, while experienced and normal users have increased. These patterns suggest that beginner programmers may be turning to AI tools for quick solutions, reducing their reliance on forums for basic questions. Meanwhile, more advanced users might be using forums to engage in deeper, more complex discussions, potentially benefiting from AI assistance as a complement rather than a replacement. Overall, these trends point to the emergence of AI coding assistance tools, as a likely driving factor behind the observed changes in content and user behavior on coding forums.